Inclass5

Overview

Predict the functionality of a water point based on a set of variables. We will build a prediction model using global generalized regression model and geographically weighted generalised regression model and compare the performance of these models using a set of evaluation metrics.

Independent Variables:

There are 2 data sets. One contains Local Government Boundary data and the other contains the water point location.

Independent Variables:

  • Distance to primary road

  • Distance to secondary road

  • Distance to tertiary road

  • Distance to city

  • Distance to town

  • Water Point population

  • Local Population 1KM

  • Usage capacity

  • Urban

  • Water Source Clean

Dependent Variable:

There are only 2 outcomes for the status water point. It is either functional or non-functional. It is a binary variable. Since this violates the linearity assumptions, linear regression cannot be used.

Benefits of Logistic Regression:

Logistic regression allows us to relax a lot of assumptions.

  • Logistic Regression does not assume a linear relationship between dependent and independent variables

  • The independent variables need not be interval nor normally distributed nor linearly related, nor equal variance within each group

Limitations of Logistic Regression:

  • Large Sample of data is needed

    • This is because Logistic regression uses maximum likelihood algorithm which maximises the probability and this requires more data

1.Data Preparation

We load the packages needed and install them if needed. Packages needed are:

  • sf, spdep, tmap, tidyverse, funModeling, shinyjs, cluster, factoextra, heatmaply, ClustGeo, GGally

First, we use pacman to load in the library we need.

pacman::p_load(sf, spdep, tmap, tidyverse, funModeling, shinyjs, cluster, factoextra, heatmaply, ClustGeo, GGally, ggpubr, corrplot, blorr, GWmodel, skimr, caret, stats)

# install.packages("caret", dependencies = TRUE)
library("caret")

Next, we are going to import the data

Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")

We take a look at the column data type and its first few instances.

glimpse(Osun)
Rows: 30
Columns: 5
$ ADM2_EN    <chr> "Aiyedade", "Aiyedire", "Atakumosa East", "Atakumosa West",…
$ ADM2_PCODE <chr> "NG030001", "NG030002", "NG030003", "NG030004", "NG030005",…
$ ADM1_EN    <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ ADM1_PCODE <chr> "NG030", "NG030", "NG030", "NG030", "NG030", "NG030", "NG03…
$ geometry   <MULTIPOLYGON [m]> MULTIPOLYGON (((213526.6 34..., MULTIPOLYGON (…
glimpse(Osun_wp_sf)
Rows: 4,760
Columns: 75
$ row_id                     <dbl> 70578, 66401, 65607, 68989, 67708, 66419, 6…
$ source                     <chr> "Federal Ministry of Water Resources, Niger…
$ lat_deg                    <dbl> 7.759448, 8.031187, 7.891137, 7.509588, 7.4…
$ lon_deg                    <dbl> 4.563998, 4.637400, 4.713438, 4.265002, 4.3…
$ report_date                <chr> "05/11/2015 12:00:00 AM", "04/30/2015 12:00…
$ status_id                  <chr> "No", "No", "No", "No", "Yes", "Yes", "Yes"…
$ water_source_clean         <chr> "Borehole", "Borehole", "Borehole", "Boreho…
$ water_source_category      <chr> "Well", "Well", "Well", "Well", "Well", "We…
$ water_tech_clean           <chr> "Mechanized Pump", "Mechanized Pump", "Mech…
$ water_tech_category        <chr> "Mechanized Pump", "Mechanized Pump", "Mech…
$ facility_type              <chr> "Improved", "Improved", "Improved", "Improv…
$ clean_country_name         <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria",…
$ clean_adm1                 <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ clean_adm2                 <chr> "Osogbo", "Odo-Otin", "Boripe", "Ayedire", …
$ clean_adm3                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ clean_adm4                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ install_year               <dbl> NA, 2004, 2006, 2014, 2011, 2007, 2007, 200…
$ installer                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rehab_year                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rehabilitator              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ management_clean           <chr> NA, NA, NA, NA, NA, "Community Management",…
$ status_clean               <chr> "Abandoned/Decommissioned", "Abandoned/Deco…
$ pay                        <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ fecal_coliform_presence    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fecal_coliform_value       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ subjective_quality         <chr> "Acceptable quality", "Acceptable quality",…
$ activity_id                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ scheme_id                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wpdx_id                    <chr> "6FV6QH57+QHH", "6FW62JJP+FXC", "6FV6VPR7+F…
$ notes                      <chr> "COSTAIN  area, Osogbo", "Igbaye", "Araromi…
$ orig_lnk                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ photo_lnk                  <chr> "https://akvoflow-55.s3.amazonaws.com/image…
$ country_id                 <chr> "NG", "NG", "NG", "NG", "NG", "NG", "NG", "…
$ data_lnk                   <chr> "https://catalog.waterpointdata.org/dataset…
$ distance_to_primary_road   <dbl> 167.82235, 4133.71306, 6559.76182, 8260.715…
$ distance_to_secondary_road <dbl> 838.9185, 1162.6246, 4625.0324, 5854.5731, …
$ distance_to_tertiary_road  <dbl> 1181.107236, 9.012647, 58.314987, 2013.6515…
$ distance_to_city           <dbl> 2449.2931, 16704.1923, 21516.8495, 31765.68…
$ distance_to_town           <dbl> 9463.295, 5176.899, 8589.103, 16386.459, 23…
$ water_point_history        <chr> "{\"2015-05-11\": {\"photo_lnk\": \"https:/…
$ rehab_priority             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ water_point_population     <dbl> NA, NA, NA, NA, 0, 508, 162, 362, 3562, 217…
$ local_population_1km       <dbl> NA, NA, NA, NA, 70, 647, 449, 1611, 7199, 2…
$ crucialness_score          <dbl> NA, NA, NA, NA, NA, 0.785162287, 0.36080178…
$ pressure_score             <dbl> NA, NA, NA, NA, NA, 1.69333333, 0.54000000,…
$ usage_capacity             <dbl> 1000, 1000, 1000, 300, 1000, 300, 300, 300,…
$ is_urban                   <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALS…
$ days_since_report          <dbl> 2689, 2700, 2688, 2693, 2701, 2692, 2681, 2…
$ staleness_score            <dbl> 42.84384, 42.69554, 42.85735, 42.78986, 42.…
$ latest_record              <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ location_id                <dbl> 239556, 230405, 240009, 236400, 229231, 237…
$ cluster_size               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
$ clean_country_id           <chr> "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "…
$ country_name               <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria",…
$ water_source               <chr> "Improved Tube well or borehole", "Improved…
$ water_tech                 <chr> "Motorised", "Motorised", "Motorised", "yet…
$ status                     <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRU…
$ adm2                       <chr> "Osogbo", "Odo-Otin", "Boripe", "Aiyedire",…
$ adm3                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ management                 <chr> NA, NA, NA, NA, NA, "Community Management",…
$ adm1                       <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ `New Georeferenced Column` <chr> "POINT (4.5639983 7.7594483)", "POINT (4.63…
$ lat_deg_original           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lat_lon_deg                <chr> "(7.7594483°, 4.5639983°)", "(8.0311867°, 4…
$ lon_deg_original           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ public_data_source         <chr> "https://catalog.waterpointdata.org/datafil…
$ converted                  <chr> "#status_id, #water_source, #pay, #status, …
$ count                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ created_timestamp          <chr> "06/30/2020 12:56:07 PM", "06/30/2020 12:56…
$ updated_timestamp          <chr> "06/30/2020 12:56:07 PM", "06/30/2020 12:56…
$ Geometry                   <POINT [m]> POINT (236239.7 417577), POINT (24463…
$ ADM2_EN                    <chr> "Osogbo", "Odo-Otin", "Boripe", "Aiyedire",…
$ ADM2_PCODE                 <chr> "NG030030", "NG030025", "NG030006", "NG0300…
$ ADM1_EN                    <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ ADM1_PCODE                 <chr> "NG030", "NG030", "NG030", "NG030", "NG030"…
Osun_wp_sf %>%
  freq(input = 'status')

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0

There are 2 categories of status - true and false. 55.5% are true and 44.5% are false.

We use the skim() function to help us with our EDA. It shows us the summary statistics. Regression models are sensitive to missing values. We can identify some of the columns with many missing fields using this method. We will not be using columns where there are too many missing values. If there are just a few, we remove those rows of data point.

Osun_wp_sf %>%
  skim()
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

We remove missing values with below code chunk:

Osun_wp_sf_clean <- Osun_wp_sf %>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.))) %>%
  mutate(usage_capacity = as.factor(usage_capacity))

We convert usage capacity as a category by using as.factor. Data is now categorical instead of numerical.

1.1 Correlation Analysis

To do correlation analysis, it does not recognise data format with geometry. Hence we set it as NULL to remove it.

Osun_wp <- Osun_wp_sf_clean %>%
  select(c(7,35:39, 42:43, 46:47, 57)) %>%
  st_set_geometry(NULL)

We take a subset of the Osun_wp from column 2 to 7.

cluster_vars.cor = cor(
  Osun_wp[,2:7])

corrplot.mixed(cluster_vars.cor,
               lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

None of them have high correlation. There will be no problem of multicollinearity.We can start building out model.

2. Building Global Regression Model

model <- glm(status ~ distance_to_primary_road +
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sf_clean,
             family = binomial(link = "logit"))

We check the summary of the model using blr_regress.

blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

Most of the features are significant predictors to use except for: distance_to_primary_road and distance_to_secondary_road.

Among the significant predictors, there is negative correlation for is_urban_true and usage_capacity1000 feature. Rest of the features have a positive correlation.

2.1 Model Evaluation

blr_confusion_matrix(model, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

Next, we evaluate the performance of the model. We set the cut off probability as 0.5. If probability outputted by LR model is greater than 0.5, it is a functional water point. Otherwise non-functional water point.

We compare the predicted values against the actual value. The accuracy of model is 67% and sensitivity/ true positive rate is 72% which is a moderate classifying performance.

2.2 Model Improvement

As we have determined the distance to primary road and distance to secondary road are insignificant features, we will remove it from our model building and rerun the summary report to see if there is an improvement.

model_cleaned <- glm(status ~ distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sf_clean,
             family = binomial(link = "logit"))

blr_regress(model_cleaned)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4746           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3540        0.1055      3.3541       8e-04 
       distance_to_tertiary_road            1      1e-04         0.0000      4.9096      0.0000 
            distance_to_city                1      0.0000        0.0000     -5.2022      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4660      0.0000 
              is_urbanTRUE                  1     -0.2667        0.0747     -3.5690       4e-04 
           usage_capacity1000               1     -0.6206        0.0697     -8.9081      0.0000 
water_source_cleanProtected Shallow Well    1      0.4947        0.0850      5.8228      0.0000 
   water_source_cleanProtected Spring       1      1.2790        0.4384      2.9174      0.0035 
         water_point_population             1      -5e-04        0.0000    -11.3902      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4069      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7349          Somers' D        0.4697   
% Discordant          0.2651          Gamma            0.4697   
% Tied                0.0000          Tau-a            0.2320   
Pairs                5585188          c                0.7349   
---------------------------------------------------------------
blr_confusion_matrix(model_cleaned, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1

Accuracy remains at 67% and senstivity/ true positive rate reamins at 72%.

How can we improve the model? We can improve the model by adding geographical weights using the GW model.

3. Model building with Geographical Weights using GWR method

GW model takes in only spatial dataframe format. Hence we need to convert it using the as_Spatial() method.

Osun_wp_sp <- Osun_wp_sf_clean %>%
  select(c(status, distance_to_primary_road,
               distance_to_secondary_road,
               distance_to_tertiary_road,
               distance_to_city,
               distance_to_town,
           water_point_population,
           local_population_1km,
           is_urban,
           usage_capacity,
           water_source_clean)) %>%
  as_Spatial()
Osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,        0,           1000,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,        1,            300,   Protected Spring 

Next we calculate the distance bandwidth metrics. We use the fixed distance bandwidth metrics here.

bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sp,
             family = "binomial",
             approach = "AIC",
             kernel = "gaussian",
             adaptive = FALSE,
             longlat = FALSE)

From above it is found bw.fixed is 2599.672.

bw.fixed <- 2599.672
bw.fixed
[1] 2599.672

The fixed bandwidth distance used will be 2599.672m. We build the geographical weighted generalised model using the ggwr.fixed() method and defining the fixed bandwith distance to use.

gwlr.fixed <- ggwr.basic(status ~ distance_to_primary_road +
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sp,
             bw = bw.fixed,
             family = "binomial",
             kernel = "gaussian",
             adaptive = FALSE,
             longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
gwlr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-17 14:00:35 
   Call:
   ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road + 
    distance_to_tertiary_road + distance_to_city + distance_to_town + 
    is_urban + usage_capacity + water_source_clean + water_point_population + 
    local_population_1km, data = Osun_wp_sp, bw = bw.fixed, family = "binomial", 
    kernel = "gaussian", adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road distance_to_city distance_to_town is_urban usage_capacity water_source_clean water_point_population local_population_1km
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-124.555    -1.755     1.072     1.742    34.333  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.887e-01  1.124e-01   3.459 0.000543
distance_to_primary_road                 -4.642e-06  6.490e-06  -0.715 0.474422
distance_to_secondary_road               -5.143e-06  9.299e-06  -0.553 0.580230
distance_to_tertiary_road                 9.683e-05  2.073e-05   4.671 3.00e-06
distance_to_city                         -1.686e-05  3.544e-06  -4.757 1.96e-06
distance_to_town                         -1.480e-05  3.009e-06  -4.917 8.79e-07
is_urbanTRUE                             -2.971e-01  8.185e-02  -3.629 0.000284
usage_capacity1000                       -6.230e-01  6.972e-02  -8.937  < 2e-16
water_source_cleanProtected Shallow Well  5.040e-01  8.574e-02   5.878 4.14e-09
water_source_cleanProtected Spring        1.288e+00  4.388e-01   2.936 0.003325
water_point_population                   -5.097e-04  4.484e-05 -11.369  < 2e-16
local_population_1km                      3.451e-04  1.788e-05  19.295  < 2e-16
                                            
Intercept                                ***
distance_to_primary_road                    
distance_to_secondary_road                  
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
water_point_population                   ***
local_population_1km                     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.0  on 4744  degrees of freedom
AIC: 5712

Number of Fisher Scoring iterations: 5


 AICc:  5712.099
 Pseudo R-square value:  0.1295351
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2599.672 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -8.7229e+02 -4.9955e+00  1.7600e+00
   distance_to_primary_road                 -1.9389e-02 -4.8031e-04  2.9618e-05
   distance_to_secondary_road               -1.5921e-02 -3.7551e-04  1.2317e-04
   distance_to_tertiary_road                -1.5618e-02 -4.2368e-04  7.6179e-05
   distance_to_city                         -1.8416e-02 -5.6217e-04 -1.2726e-04
   distance_to_town                         -2.2411e-02 -5.7283e-04 -1.5155e-04
   is_urbanTRUE                             -1.9790e+02 -4.2908e+00 -1.6864e+00
   usage_capacity1000                       -2.0772e+01 -9.7231e-01 -4.1592e-01
   water_source_cleanProtected.Shallow.Well -2.0789e+01 -4.5190e-01  5.3340e-01
   water_source_cleanProtected.Spring       -5.2235e+02 -5.5977e+00  2.5441e+00
   water_point_population                   -5.2208e-02 -2.2767e-03 -9.8875e-04
   local_population_1km                     -1.2698e-01  4.9952e-04  1.0638e-03
                                                3rd Qu.      Max.
   Intercept                                 1.2763e+01 1073.2156
   distance_to_primary_road                  4.8443e-04    0.0142
   distance_to_secondary_road                6.0692e-04    0.0258
   distance_to_tertiary_road                 6.6815e-04    0.0128
   distance_to_city                          2.3718e-04    0.0150
   distance_to_town                          1.9271e-04    0.0224
   is_urbanTRUE                              1.2841e+00  744.3099
   usage_capacity1000                        3.0322e-01    5.9281
   water_source_cleanProtected.Shallow.Well  1.7849e+00   67.6343
   water_source_cleanProtected.Spring        6.7663e+00  317.4133
   water_point_population                    5.0102e-04    0.1309
   local_population_1km                      1.8157e-03    0.0392
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2795.084 
   AIC : 4414.606 
   AICc : 4747.423 
   Pseudo R-square value:  0.5722559 

   ***********************************************************************
   Program stops at: 2022-12-17 14:01:50 

Geographically weighted generalized regression works by building multiple generalized linear models for locations in same bandwidth. Each model has their own localized geographical weights.

We observe like in the global generalised linear model, the metrics distance to primary road and distance to secondary road are not significant as they have p values higher than 0.05.

From the summary report above, we can also see the model with geographical weights performs better with a lower AIC . Hence it is a better model with geographical weights.

3.1 Model Comparison

We want to compute a confusion matrix. We need to cover the SDF object into a date frame.

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

We will label predicted values with probability higher than 0.5 into 1 and else 0. The result will be saved as most column.

gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat >= 0.5, T, F))

There is no easy method to print out confusion report. Hence we create our own using confusion Matrix from caret library using the actual values column and gwr.fixed and the newly appended column “most”.

gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data=gwr.fixed$most, reference = gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1824  263
     TRUE    290 2379
                                          
               Accuracy : 0.8837          
                 95% CI : (0.8743, 0.8927)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7642          
                                          
 Mcnemar's Test P-Value : 0.2689          
                                          
            Sensitivity : 0.8628          
            Specificity : 0.9005          
         Pos Pred Value : 0.8740          
         Neg Pred Value : 0.8913          
             Prevalence : 0.4445          
         Detection Rate : 0.3835          
   Detection Prevalence : 0.4388          
      Balanced Accuracy : 0.8816          
                                          
       'Positive' Class : FALSE           
                                          

By using geographical weighted model, the sensitivity has improved to 86.3% and accuracy to 88.4%.

3.2 Model Improvement

Can we improve the model if we remove the insignificant variables? We create the distance bandwidth metrics without the “distance to primary road” and “distance to secondary road” metrics.

bw.fixed_cleaned <- bw.ggwr(status ~ distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sp,
             family = "binomial",
             approach = "AIC",
             kernel = "gaussian",
             adaptive = FALSE,
             longlat = FALSE)

Lets check what bandwidth we should use.

bw.fixed_cleaned <- 2377.371
bw.fixed_cleaned
[1] 2377.371

The optimal fixed bandwith distance to use it 2377.371m. Next we build the model using the distance bandwidth matrix built earlier.

gwlr.fixed_cleaned <- ggwr.basic(status ~distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sp,
             bw = bw.fixed_cleaned,
             family = "binomial",
             kernel = "gaussian",
             adaptive = FALSE,
             longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 

Next we check the model evaluation metrics

gwr.fixed_cleaned <- as.data.frame(gwlr.fixed_cleaned$SDF)
gwr.fixed_cleaned <- gwr.fixed_cleaned %>%
  mutate(most = ifelse(
    gwr.fixed_cleaned$yhat >= 0.5, T, F))
gwr.fixed_cleaned$y <- as.factor(gwr.fixed_cleaned$y)
gwr.fixed_cleaned$most <- as.factor(gwr.fixed_cleaned$most)
CM_cleaned <- confusionMatrix(data=gwr.fixed_cleaned$most, reference = gwr.fixed_cleaned$y)
CM_cleaned
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1833  268
     TRUE    281 2374
                                          
               Accuracy : 0.8846          
                 95% CI : (0.8751, 0.8935)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7661          
                                          
 Mcnemar's Test P-Value : 0.6085          
                                          
            Sensitivity : 0.8671          
            Specificity : 0.8986          
         Pos Pred Value : 0.8724          
         Neg Pred Value : 0.8942          
             Prevalence : 0.4445          
         Detection Rate : 0.3854          
   Detection Prevalence : 0.4418          
      Balanced Accuracy : 0.8828          
                                          
       'Positive' Class : FALSE           
                                          

By removing the insignificant variables using geographical weighted model, the sensitivity has improved to 86.7% from 86.3% while accuracy remains the same at 88.4%.

4. Visualization

Next we visualize the Local government areas water point functionality prediction. We set tm_dots to display only 2 categories - below 0.5 and above 0.5

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,ADM1_EN, ADM1_PCODE, status))
gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed_cleaned)
tmap_mode("view")

prob_t <- tm_shape(Osun) + 
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1,
          n=2) +
  tm_view(set.zoom.limits = c(8,14))

prob_t

We also plot the original graph with original values.

og_y <- tm_shape(Osun) + 
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "y",
          border.col = "gray60",
          border.lwd = 1,
          n=2, 
          palette = "Blues") +
  tm_view(set.zoom.limits = c(8,14))

og_y

We put them side by side. As we can see there are more misclassifications in the North West region. The predicted values predicted more TRUE (False Positives) than Actual TRUE.

4.1 Tuning Threshold

One way to address it is to increase the threshold. This reduces the number of False Positive but increases the number of False Negative.

In this context, a model that predicts less False positive is preferred because it is a more conservative approach to monitor water point functionality. It will be worse off if taps that are not working are marked as working thus shortchaning it of timely repairs which in return deprives villagers from access to water.

tmap_arrange(prob_t,og_y)

We increase threshold to 0.6.

gwr.fixed_cleaned <- as.data.frame(gwlr.fixed_cleaned$SDF)
gwr.fixed_cleaned <- gwr.fixed_cleaned %>%
  mutate(most = ifelse(
    gwr.fixed_cleaned$yhat >= 0.6, T, F))
gwr.fixed_cleaned$y <- as.factor(gwr.fixed_cleaned$y)
gwr.fixed_cleaned$most <- as.factor(gwr.fixed_cleaned$most)
CM_cleaned <- confusionMatrix(data=gwr.fixed_cleaned$most, reference = gwr.fixed_cleaned$y)
CM_cleaned
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1976  472
     TRUE    138 2170
                                          
               Accuracy : 0.8717          
                 95% CI : (0.8619, 0.8811)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7443          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9347          
            Specificity : 0.8213          
         Pos Pred Value : 0.8072          
         Neg Pred Value : 0.9402          
             Prevalence : 0.4445          
         Detection Rate : 0.4155          
   Detection Prevalence : 0.5147          
      Balanced Accuracy : 0.8780          
                                          
       'Positive' Class : FALSE           
                                          
gwr_sf.fixed2 <- cbind(Osun_wp_sf_selected, gwr.fixed_cleaned)

prob_t_2 <- tm_shape(Osun) + 
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed2) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1,
          breaks = c(0,0.6,1)) +
  tm_view(set.zoom.limits = c(8,14))


og_y_2 <- tm_shape(Osun) + 
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed2) +
  tm_dots(col = "y",
          border.col = "gray60",
          border.lwd = 1, 
          palette = "Blues") +
  tm_view(set.zoom.limits = c(8,14))

Increasing threshold has improved the sensitivity. Now we visualize it in a graph. The 2 graphs look more similar.

tmap_arrange(prob_t_2,og_y_2)

4.2 Visualising Metrics

We can also visualize the metrics of the variables. gwr_sf.fixed contains both original values (ends with SE) and t statistics value (TV).

tertiary_TV <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "distance_to_tertiary_road_TV", border.col = "gray60", border.lwd=4, palette = "Purples") +
  tm_view(set.zoom.limits = c(8,14))

tertiary_SE <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "distance_to_tertiary_road_SE", border.col = "gray60", border.lwd=4) +
  tm_view(set.zoom.limits = c(8,14))


tmap_arrange(tertiary_SE, tertiary_TV, asp=1, ncol=2, sync = TRUE)
town_TV <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "distance_to_town_TV", border.col = "gray60", border.lwd=4, palette = "Purples") +
  tm_view(set.zoom.limits = c(8,14))

town_SE <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "distance_to_town_SE", border.col = "gray60", border.lwd=4) +
  tm_view(set.zoom.limits = c(8,14))


tmap_arrange(town_SE, town_TV, asp=1, ncol=2, sync = TRUE)